(EAI-375): Ingest snooty docs facets and meta #558

mongodben · 2024-11-14T14:39:32Z

Jira: https://jira.mongodb.org/browse/EAI-375

Changes

Ingest Snooty docs meta and facets fields as Page.metadata
Add chunking concurrency

Notes

This PR only ingests the facets that the docs put inside of pages, not in the facets.toml files. There's a DOP ticket to capture the facets.toml https://jira.mongodb.org/browse/DOP-5182

Experiment Results

Initial experiment

Experiment compares using the ingestion pipeline with the new snooty metadata to the previous baseline. The results can be found here: mongodb-chatbot-retrieval/experiments/mongodb-chatbot-retrieval-snooty-metadata

The results actually show a very slight decrease in search quality as a result of these changes:

BinaryNDCG@5 goes from 33.20% -> 33.05% (-0.15%)
F1@5 goes from 23.33% -> 22.88% (-0.45%)

Follow up experiment

After upgrading embedding model and preprocessor. Even worse results. Can be seen here in Braintrust.

BinaryNDCG@5 goes from 51.77% -> 46.93%

Next Steps

Based on the results, I think we should not ingest and chunk this metadata. Instead, we should ingest it so it's present in the pages collection, but doesn't get included in the embedded_content. Will create follow-up PR for that.

nlarew · 2024-11-18T16:54:44Z

packages/ingest-mongodb-public/src/sources/snooty/snootyAstToMd.ts

+  const facetAndMetaNodes = findAll(
+    node,
+    ({ name }) => name === "facet" || name === "meta"
+  ) as (SnootyFacetNode | SnootyMetaNode)[];


This works for now. Longer term if we plan to maintain this Snooty AST glue code then it might be nice to move to Zod parsing for these instead of type casts.

ingest snooty docs facets and meta

739e4a9

mongodben marked this pull request as draft November 14, 2024 14:44

mongodben added 3 commits November 14, 2024 10:29

page prefix on keys

a64df92

remove trailing/leading whitespace

154128e

Support concurrent embedding

a9f6fdb

mongodben marked this pull request as ready for review November 15, 2024 14:49

mongodben added the DO NOT MERGE Not yet ready for merge label Nov 15, 2024

nlarew reviewed Nov 18, 2024

View reviewed changes

nlarew approved these changes Nov 18, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into EAI-375

a1410c1

mongodben closed this Dec 10, 2024

mongodben mentioned this pull request Dec 10, 2024

(EAI-375 lite): Include page metadata in ingest, but not chunking #576

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(EAI-375): Ingest snooty docs facets and meta #558

(EAI-375): Ingest snooty docs facets and meta #558

mongodben commented Nov 14, 2024 •

edited

Loading

nlarew Nov 18, 2024

(EAI-375): Ingest snooty docs facets and meta #558

(EAI-375): Ingest snooty docs facets and meta #558

Conversation

mongodben commented Nov 14, 2024 • edited Loading

Changes

Notes

Experiment Results

Initial experiment

Follow up experiment

Next Steps

nlarew Nov 18, 2024

Choose a reason for hiding this comment

mongodben commented Nov 14, 2024 •

edited

Loading